20:07
2026-06-28
swelljoe.com
large-language-models
Shell Games
A new benchmark test of Ornith 1.0, a model that builds its own task scaffolds, found that providing a full shell and Python environment doubled its bug-finding performance without increasing false poโฆ